Show the code
pacman::p_load(jsonlite, tidygraph, ggraph, dplyr,
visNetwork, graphlayouts, ggforce, knitr, kableExtra,
skimr, tidytext, tidyverse, igraph, ggplot2, RColorBrewer, wordcloud)This exercise will attempt to Question 1 of Mini Challenge 2 from Vast Challenge 2023. This exercise focuses on utilizing visual analytics to identify anomalies in the business groups within the knowledge graph.
In the code chunk below, necessary R packages are installed to import data, data preparation and visualization.
In the code chunk below, fromJSON function is used to import MC3 data into R environment.
The code chunk below is used to extract links from mc3_data and stored in the tibble data frame called “mc3_edges”.
The code chunk below is used to extract nodes from mc3_data and stored in the tibble data frame called “mc3_edges”.
In the code chunk below, datatable() functions is used to display mc3_edges and mc3_nodes tibble data frame as an interactive table.
In this code chunk, unnest_tokens function is used to split workds in “product_services” column.
Tidytext package function called stop_words is used to remove stop words.
In the code chunk below, a new node data table is prepared by using the source and target field of mc3_edges data frame.
In this code chunk, tidy graph data model is created.
Betweenness Centrality and Degree Centrality measurements are used to quantify each node’s interaction in the network and visualize the interactions. In the Code chunk below, mean and quartiles of betweenness centrality and degree centrality are calculated.
# Calculate mean and quartiles
mean_betweenness <- mean(mc3_df$betweenness_centrality, na.rm = TRUE)
quartiles_betweenness <- quantile(mc3_df$betweenness_centrality, probs = c(0.5,0.9,0.95,0.99), na.rm = TRUE)
max_betweenness <- max(mc3_df$betweenness_centrality, na.rm = TRUE)
mean_degree <- mean(mc3_df$degree_centrality, na.rm = TRUE)
quartiles_degree <- quantile(mc3_df$degree_centrality, probs = c(0.50,0.90,0.95), na.rm = TRUE)
max_degree <- max(mc3_df$degree_centrality, na.rm = TRUE)After that,the results are stored in the respective data frames.
# Create a data frame for betweenness centrality
betweenness_df <- data.frame(
Measure = c("Mean", "Quartile 0.50 (Median)", "Quartile 0.90", "Quartile 0.95", "Quartile 0.99", "Maximum"),
Value = c(mean_betweenness, quartiles_betweenness, max_betweenness)
)
# Create a data frame for degree centrality
degree_df <- data.frame(
Measure = c("Mean", "Quartile 0.50 (Median)", "Quartile 0.90","Quartile 0.95","Maximum"),
Value = c(mean_degree, quartiles_degree,max_degree)
)Betweenness Centrality Measurement | |
|---|---|
| Measure | Value |
| Mean | 8393.619 |
| Quartile 0.50 (Median) | 0.000 |
| Quartile 0.90 | 3.000 |
| Quartile 0.95 | 176.400 |
| Quartile 0.99 | 165050.000 |
| Maximum | 3849384.703 |
Degree Centrality Measurement | |
|---|---|
| Measure | Value |
| Mean | 1.287965 |
| Quartile 0.50 (Median) | 1.000000 |
| Quartile 0.90 | 2.000000 |
| Quartile 0.95 | 3.000000 |
| Maximum | 120.000000 |
An interactive plot showing top few nodes and edges with highest betweenness_centrality will be created.
For this, in this code chunk, edges with betweenness_centrality value over 100,000 are filtered and stored in new data frame.
In this code chunk, nodes with betweenness_centrality value over 100,000 are filtered and stored in new data frame.
Then, an interactive network graph that shows interaction of business groups with betweenness centrality value higher than 100,000 is plotted. The nodes are colored according to their business type.
nodes_top_betweenness <- nodes_top_betweenness %>% mutate(group = ifelse(is.na(type), "NA", type))
vis_plot <- visNetwork(nodes_top_betweenness, edges_top_betweenness,
main = "Betweenness Centrality")%>%
visIgraphLayout(layout = "layout_with_kk", smooth = FALSE,
physics = TRUE
) %>%
visNodes(size = 50, label=nodes_top_betweenness$label) %>%
visEdges(color = list(highlight = "lightgray"), arrows = 'to') %>%
visOptions(selectedBy = "type",
highlightNearest = list(enabled = TRUE,
degree = 1,
hover = TRUE,
labelOnly = TRUE),
nodesIdSelection = TRUE) %>%
visGroups(groupname = "Company", color = "lightblue") %>%
visGroups(groupname = "Company Contacts", color = "salmon") %>%
visGroups(groupname = "Beneficial Owner", color = "yellow") %>%
visGroups(groupname = "NA", color = "grey") %>%
visLegend(width = 0.1) %>%
visPhysics(repulsion = list(springlength = 50),
maxVelocity = 2,
solver = "forceAtlas2Based",
forceAtlas2Based = list(gravitationalConstant = -1000),
timestep = 0.25) %>%
visLayout(randomSeed=4)
vis_plotAn interactive plot showing nodes and edges in top 95% percentile of degree_centrality will be created.
In this code chunk, edges with degree_centrality value greater than or equals to 3 are filtered and stored in new data frame.
In this code chunk, nodes with degree_centrality value greater than or equals to 3 are filtered and stored in new data frame.
Then, an interactive network graph that shows interaction of business groups with degree centrality greater than or equal to 3 is plotted. The nodes are colored according to their business type. In the plot, we can see few comapny owners who interactived with several different business groups.
nodes_top_degree <- nodes_top_degree %>% mutate(group = ifelse(is.na(type), "NA", type))
vis_plot_2 <- visNetwork(nodes_top_degree, edges_top_degree,
main = "Degree Centrality")%>%
visIgraphLayout(layout = "layout_with_kk", smooth = FALSE,
physics = TRUE
) %>%
visNodes(size = 50, label=nodes_top_degree$label) %>%
visEdges(color = list(highlight = "lightgray"), arrows = 'to') %>%
visOptions(selectedBy = "type",
highlightNearest = list(enabled = TRUE,
degree = 1,
hover = TRUE,
labelOnly = TRUE),
nodesIdSelection = TRUE) %>%
visGroups(groupname = "Company", color = "lightblue") %>%
visGroups(groupname = "Company Contacts", color = "salmon") %>%
visGroups(groupname = "Beneficial Owner", color = "yellow") %>%
visGroups(groupname = "NA", color = "grey") %>%
visLegend(width = 0.1) %>%
visPhysics(repulsion = list(springlength = 50),
maxVelocity = 2,
solver = "forceAtlas2Based",
forceAtlas2Based = list(gravitationalConstant = -1000),
timestep = 0.25) %>%
visLayout(randomSeed=4)
vis_plot_2In the code chunk below, the counts of company ownership by Company Owner is calculated and shown in data table.
Mean and quartile of company ownership counts are calculated. And owners with higher than average ownership (5) are stored in the list called “target list”.
mean_counts <- mean(owner_count$count, na.rm = TRUE)
quartiles_counts <- quantile(owner_count$count, probs = c(0.5,0.95), na.rm = TRUE)
# Create a data frame for betweenness centrality
count_df <- data.frame(
Measure = c("Mean", "Quartile 0.50 (Median)", "Quartile 0.95"),
Value = c(mean_counts, quartiles_counts)
)
count_df %>%
kbl() %>%
kable_paper("hover", full_width = F) %>%
row_spec(0, bold = T, color = "white", background = "#D7261E")| Measure | Value | |
|---|---|---|
| Mean | 3.472574 | |
| 50% | Quartile 0.50 (Median) | 3.000000 |
| 95% | Quartile 0.95 | 5.000000 |
A new data frame which only include data related to target list is created.
A new nodes data table is created using the source and target field of “top_owner_edges” data frame.
The code cunk below is used to build another new graph data model.
Next, in the code chunk below, we will extract and covert edges and nodes into tibble data frame.
An interactive network graph is built to visualize the interactions of company owners with more than 5 company ownership.
In the code chunk below, new data frame called ’top_owner_words” is created. This data frame will be used to top products and services traded among top company owners.
Next, a word cloud is created to visualize the most traded products among the company owners. As seen on the word cloud, seafood,fish,tuna and salmon are the most traded products.
In conclusion, this exercise provided insights into the network of company ownership, allowing us to identify influential owners based on centrality measures and explore the common words associated with the companies. The combination of network analysis and text mining techniques helped us gain a deeper understanding of the relationships and patterns within the data.